1 Introduction

[1] RoBERTa: A Robustly Optimized BERT Pretraining Approach
Link : http://arxiv.org/abs/1907.11692
Institute：University of Washington, Facebook AI
Code : https://github.com/pytorch/fairseq

1.1 Achievement

Present a replication study of BERT, carefully measuring the impact of many key parameters and training data size
Find that BERT can be further improved
Achieve state-of-the-art on GLUE, RACE and SQuAD

2 Method

For the detail of the original BERT model, visit here.
In the coming sections, the modification of their proposed model based on experiment will be introduced.

2.1 Static vs. Dynamic Masking

Table 2.1 Static vs. Dynamic Masking

BERT	Method 1
Static Mask	Dynamic Mask

BERT: (Static Mask)

Only mask once while data-preprocessing
Result in information loss

Method 1: (Dynamic Mask)

Mask in 10 different ways for 4 loops
Reduce the effect of information loss

Table 2.2 Comparison between static and dynamic masking

Result:

Static masking performs similar to the original BERT model, and dynamic masking is comparable or slightly better than static masking.
Finally choose dynamic masking for RoBERTa

2.2 Model Input Format and Next Sentence Prediction

Table 2.3 Several Training Format

BERT	Method 1	Method 2	Method 3
SEGMENT-PAIR+NSP	SENTENCE-PAIR+NSP	FULL-SENTENCES	DOC-SENTENCES

BERT: (SEGMENT-PAIR+NSP)

Input with a pair of segments
* Each segment can contain multiple sentences but the maximum token length of the input is 512.
Train the model with NSP loss

Method 1: (SENTENCE-PAIR+NSP)

Input with a pair of sentences
* They increase batch size to obtain a similar total number of tokens with BERT(SEGMENT-PAIR+NSP).
Train the model with NSP loss

Method 2: (FULL-SENTENCES)

Input with full sentences sampled contiguously from different documents
* The total length is at most 512.
Add an extra separator token to indicate the end of one document and directly begin sampling the sentences of the next document
* Deal with the boundary
Remove NSP loss

Method 3: (DOC-SENTENCES)

Input with full sentences sampled contiguously from different documents but boundary-crossing is not allowed
Input tokens may much shorter than 512 because of the cut-off near the boundary.
Dynamically increase the batch size near the boundary to achieve a similar token number as Method 2(FULLSENTENCES)
Remove NSP loss

Table 2.4 Comparison between Different Input Formating Methods

Result:

Using individual sentences hurts performance on downstream tasks
* They hypothesize that the model may not able to learn long-term dependency if using individual sentences.
Removing the NSP loss matches or slightly improve the downstream task performance
Method 3(DOC-SENTENCES) performs slightly better than Method 2(FULL-SENTENCES).
To avoid variable batch size, they finally choose Method 2(FULL-SENTENCES) for RoBERTa.

2.3 Training with Large Batches

In this section, they try to get a better result by increasing the batch size.

Table 2.5 Comparison between Different Batch Size

Result:

It shows that training with large batches improves perplexity and end-task accuracy.
Large batches can be easier to do parallelization.

2.4 Text Encoding

In this section, they try to get a better result by using a larger byte-level BPE(Byte-Pair Encoding).

3 RoBERTa

Table 3.1 Comparison between RoBERTa, BERT_LARGE and XLNet_LARGE

RoBERTa: (Robustly optimized BERT approach)

RoBERTa, a modification version of BERT model, is trained with dynamic masking(2.1), FULL-SENTENCES without NSP loss(2.2), large mini-batches(2.3), and a larger byte-level BPE(2.4).
Experiment settings:
2.1. Based on BERT_LARGE(L = 24, H = 1024, A = 16, 355M parameters)
2.2. Trained with 1024 Tesla V100 GPUs for about one day
Experiment Result:
3.1. Improve a lot over the original BERT
3.2. The ultimate version achieves state-of-the-art, outperforming BERT and XLNet.

4 Conclusion

What may improve the performance of BERT:

Train the model for longer time, with bigger batches over more data
Remove NSP
Train on longer sequences
Dynamic masking

文献阅读 RoBERTa A Robustly Optimized BERT Pretraining Approach

1 Introduction

1.1 Achievement

2 Method

2.1 Static vs. Dynamic Masking

2.2 Model Input Format and Next Sentence Prediction

2.3 Training with Large Batches

2.4 Text Encoding

3 RoBERTa

4 Conclusion

文献阅读 Summary Level Training of Sentence Rewriting for Abstractive Summarization

Python的文件IO

彩音